Thompson Sampling for Monte Carlo Tree Search and Maxi-min Action Identification
نویسندگان
چکیده
The Multi-Armed Bandit(MAB) problem is named after slot machine games. When playing slot machines, one player has to decide which machine to play, in which order to play them and how many times to play each machine. After the choice, that specific machine will offer a random reward from a probability distribution, and the player’s target is to maximize the sum of rewards earned through a sequence of lever pulls. In order to figure out the distribution of each machine as soon as possible and to get as much profit as possible, we will consider the popular Thompson Sampling (TS) method, which is based on Bayesian ideas. TS is a heuristic for choosing actions that maximize the expected reward with respect to a randomly drawn belief.[9] In this thesis, for the first half part, we test the performance of TS and also compare a variation called Top-two Thompson Sampling(TTTS) method to the normal TS, based on uniform sampling. Computationally, TTTS is a slow algorithm, so we also try to improve its performance and create another algorithm: Top-two Gibbs Thompson sampling, which combines TTTS and Gibbs Sampling methods and improves the computation speed of the TTTS method. In the second half of the thesis, we try to take a step forward in the application of TS, so we combine TS with the Maximin Action Identification(MAI) problem. Maximin is a concept from two-player zero-sum games in game theory. The main idea behind maximin action selection is a constant alternation of minimizing and maximizing the value of moves to account for an adversarial opponent in the game. Existing methods of maximin are related to games such as Checkers, Chess and Go. We try to broaden its application area and apply it to our new algorithms created in the first half part. First of all, the time budget is limited and we use different divisions to test the performance; Afterwards, we create a new algorithm to pick out the right arm with one step for two layers. The results show that there is not any significant difference between the time division method, which consists of even time budgets for he child layer and no budget for parent layer, and the Maximin Thompson Sampling method.
منابع مشابه
Thompson Sampling Based Monte-Carlo Planning in POMDPs
Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning under uncertainty. One of the key challenges is the tradeoff between exploration and exploitation. To address this, we introduce a novel online planning algorithm for large POMDPs using Thompson sampling based MCTS that balances between cumulative and simple regrets. The proposed algorithm — Dirichlet-Di...
متن کاملBayesian Mixture Modeling and Inference based Thompson Sampling in Monte-Carlo Tree Search
Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorith...
متن کاملBayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorith...
متن کاملConvolutional Monte Carlo Rollouts in Go
In this work, we present a MCTS-based Go-playing program which uses convolutional networks in all parts. Our method performs MCTS in batches, explores the Monte Carlo search tree using Thompson sampling and a convolutional network, and evaluates convnet-based rollouts on the GPU. We achieve strong win rates against open source Go programs and attain competitive results against state of the art ...
متن کاملEfficient Sampling Method for Monte Carlo Tree Search Problem
We consider Monte Carlo tree search problem, a variant of Min-Max tree search problem where the score of each leaf is the expectation of some Bernoulli variables and not explicitly given but can be estimated through (random) playouts. The goal of this problem is, given a game tree and an oracle that returns an outcome of a playout, to find a child node of the root which attains an approximate m...
متن کامل